-
Notifications
You must be signed in to change notification settings - Fork 25.6k
ES|QL SAMPLE aggregation function #127629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
55fe96f to
62de767
Compare
|
Pinging @elastic/ml-core (Team:ML) |
|
Hi @jan-elastic, I've created a changelog YAML for you. |
...te/src/main/generated-src/org/elasticsearch/compute/aggregation/SampleBooleanAggregator.java
Outdated
Show resolved
Hide resolved
| "version" }, | ||
| description = "Collects sample values for a field.", | ||
| type = FunctionType.AGGREGATE, | ||
| examples = @Example(file = "stats_sample", tag = "doc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have an example of the output in the docs. I'm not entirely sure the right way to hack that one up because it's non-deterministic. Maybe it's hand rolled.
I think we want that example because my first question when reading this is "can I get duplicates or do those count as distinct samples?" Mostly because I'm not good at statistics.
I do think it's interesting that SAMPLE(bool) is strictly more work than VALUES(bool). It feels like sampling shouldn't be, but it makes some sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense that SAMPLE(bool) is more work. VALUES(bool) just keeps track of two boolean values: does true exist and does false exist. SAMPLE(bool) does more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obv, I prefer some example output too. I didn't know how to achieve that, but I'll think of something. Should've left a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've hacked up something. Not particularly proud of it, but it gets the job done.
717536b to
3c4bae7
Compare
| this.breaker = bigArrays.breakerService().getBreaker(CircuitBreaker.REQUEST); | ||
| this.sort = new BytesRefBucketedSort(breaker, "sample", bigArrays, SortOrder.ASC, limit); | ||
| this.bytesRefBuilder = new BreakingBytesRefBuilder(breaker, "sample"); | ||
| this.random = new SplittableRandom(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes on this SplittableRandom:
- If I replace it by
Random, I get the precommit error
Forbidden method invocation: java.util.Random#() [Use org.elasticsearch.common.Randomness#get for reproducible sources of randomness]
Using SplittableRandom instead works around this by not being on the blacklist, but that's not in the spirit of what's intended.
- If I replace it by
Randomness.get()in the constructor, I get:
java.lang.IllegalStateException: This Random was created for/by another thread (Thread[#39,TEST-SampleLongAggregatorFunctionTests.testManyInitialManyPartialFinalRunner-seed#[B3B51719A90700AD],5,TGRP-SampleLongAggregatorFunctionTests]). Random instances must not be shared (acquire per-thread). Current thread: Thread[#51,elasticsearch[test][esql_test_executor][T#1],5,TGRP-SampleLongAggregatorFunctionTests]
Even though the Aggregator is used on a single thread (I hope; otherwise there are more issues), it's created on a different thread then the thread that's actively using it.
- If I replace it by
Randomness.get()inside theaddmethod, the testSampleLongAggregatorFunctionTests::testDistributionfails. It looks like each iteration instantiates the same random generator (same seed), leading to the statistics being completely wrong.
I'm still looking into these issues. If you have any thoughts, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! The Lucene stuff has stuff like:
if (Thread.currentThread() != prevThread) {
prevThread = Thread.currentThread();
random = Randomness.get();
}
That might do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you exactly mean by this. Where's this stuff exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also whipped up a different fix. Let me know what you think...
3c4bae7 to
d46b18f
Compare
d46b18f to
520087d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Added some questions and things to check 👀
...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Sample.java
Show resolved
Hide resolved
...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Sample.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/elasticsearch/compute/aggregation/SampleBytesRefAggregatorFunctionTests.java
Show resolved
Hide resolved
.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/X-SampleAggregator.java.st
Show resolved
Hide resolved
520087d to
940b6f4
Compare
|
Pinging @elastic/kibana-esql (ES|QL-ui) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
test/framework/src/main/java/org/elasticsearch/test/MixWithIncrement.java
Outdated
Show resolved
Hide resolved
865c90c to
c1dc208
Compare
.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/X-SampleAggregator.java.st
Show resolved
Hide resolved
* ES|QL SAMPLE aggregation function * [CI] Auto commit changes from spotless * ThreadLocalRandom -> SplittableRandom * Update docs/changelog/127629.yaml * fix yaml test * Add SampleTests * docs + example * polish code * mark generated imports * comment with algorith description * use Randomness.get() * close properly * type checks * reuse hash * regen some files * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
* ES|QL SAMPLE aggregation function * [CI] Auto commit changes from spotless * ThreadLocalRandom -> SplittableRandom * Update docs/changelog/127629.yaml * fix yaml test * Add SampleTests * docs + example * polish code * mark generated imports * comment with algorith description * use Randomness.get() * close properly * type checks * reuse hash * regen some files * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
* ES|QL SAMPLE aggregation function * [CI] Auto commit changes from spotless * ThreadLocalRandom -> SplittableRandom * Update docs/changelog/127629.yaml * fix yaml test * Add SampleTests * docs + example * polish code * mark generated imports * comment with algorith description * use Randomness.get() * close properly * type checks * reuse hash * regen some files * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
* ES|QL SAMPLE aggregation function * [CI] Auto commit changes from spotless * ThreadLocalRandom -> SplittableRandom * Update docs/changelog/127629.yaml * fix yaml test * Add SampleTests * docs + example * polish code * mark generated imports * comment with algorith description * use Randomness.get() * close properly * type checks * reuse hash * regen some files * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
* ES|QL SAMPLE aggregation function * [CI] Auto commit changes from spotless * ThreadLocalRandom -> SplittableRandom * Update docs/changelog/127629.yaml * fix yaml test * Add SampleTests * docs + example * polish code * mark generated imports * comment with algorith description * use Randomness.get() * close properly * type checks * reuse hash * regen some files * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
* ES|QL SAMPLE aggregation function * [CI] Auto commit changes from spotless * ThreadLocalRandom -> SplittableRandom * Update docs/changelog/127629.yaml * fix yaml test * Add SampleTests * docs + example * polish code * mark generated imports * comment with algorith description * use Randomness.get() * close properly * type checks * reuse hash * regen some files * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
* ES|QL SAMPLE aggregation function (#127629) * ES|QL SAMPLE aggregation function * [CI] Auto commit changes from spotless * ThreadLocalRandom -> SplittableRandom * Update docs/changelog/127629.yaml * fix yaml test * Add SampleTests * docs + example * polish code * mark generated imports * comment with algorith description * use Randomness.get() * close properly * type checks * reuse hash * regen some files * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]> * Fix + unmute SampleTests (#127959) * Fix memory tracking of ES|QL sample agg (#128467) * Fix memory tracking of ES|QL sample agg * [CI] Auto commit changes from spotless * polish code --------- Co-authored-by: elasticsearchmachine <[email protected]> * ESQL: Unclean generated imports (#127723) This removes a ton of the tricky juggling we do for generated java files to keep the imports in order. Instead, we just live with them being out of order a little. It's not great, but it's so so so much easier than the terrible juggling we had been doing. * ESQL: Disable format checks on generated imports (#127648) This builds the infrastructure to disable spotless and some checkstyle rules on generated imports. This works around the most frustrating part of ESQL's string template generated files - the imports. It allows unused and out of order imports. This can let us remove a lot of cumbersome, tricky, and fairly useless `$if$` blocks from the templates. --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Nik Everett <[email protected]>
|
@jan-elastic just a reminder here, that we'll need to add applies_to tags to this function eventually when it gets added per README.md#version-differentiation-in-docs-v3 :) |
|
@leemthompo Are you fixing this as well together with the other fixes? Thanks! |
|
@jan-elastic yup sure can do, what's the correct availability info for this? |
Thanks! The |
* Add applies to to ScalB function in elastic#127696 * Add applies_to to categorize, follow up to elastic#129398 * Add version info, following elastic#127629 * SAMPLE is new + GA in 9.1 elastic#127629 * add applies to for 9.2 option (cherry picked from commit 5d565b5) # Conflicts: # docs/reference/query-languages/esql/_snippets/functions/parameters/categorize.md # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/Categorize.java
* Add applies to to ScalB function in #127696 * Add applies_to to categorize, follow up to #129398 * Add version info, following #127629 * SAMPLE is new + GA in 9.1 #127629 * add applies to for 9.2 option (cherry picked from commit 5d565b5) # Conflicts: # docs/reference/query-languages/esql/_snippets/functions/parameters/categorize.md # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/Categorize.java
* Add applies to to ScalB function in elastic#127696 * Add applies_to to categorize, follow up to elastic#129398 * Add version info, following elastic#127629 * SAMPLE is new + GA in 9.1 elastic#127629 * add applies to for 9.2 option
No description provided.